1.a) Tidyeval and functional programming

Published

Tue, 30 of April, 2024

Modified

Tue, 30 of April, 2024

Caution

Web page construction in progress…

Learning resources

Explanation

Premise: tidyverse functions use tidy evaluation (= they don’t evaluate the value of a variable right away! = Non-Standard evaluation).

  • (+) This means you can do some intermediate transformation to the variable in abstract (e.g. to a generic “column” thing)
  • (-) it’s hard to refer to variables indirectly, and hence harder to program with

In contrast, normal/base/custom R functions DO evaluate objects (i.e. a+b) as soon as possible = Standard evaluation

So, to take full advantage of Non-Standard evaluation (more interactivity, but also writing custom functions), I will need a sort of METAVARIABLE (a “quosure”), i.e. something that doesn’t get evaluated until I tell so.

NON STANDARD EVALUATION in TIDYVERSE

  • DEFUSING (DELAYING) function arguments: I can create a “quosure” with rlang::enquo() / rlang::enquos() so an expression can be examined, modified, and injected into other expressions.

TWO (complementary) FORMS of NSE used in the Tidyverse

1) TIDY SELECTION is used in SELECTION VERBS

e.g. in dplyr::select() across(), relocate(), rename() and pull() use tidy selection where expressions are either interpreted in the context of the data frame (e.g. c(cyl, am) or evaluated in the user environment (e.g. all_of(), starts_with())

# EXE `across()`
summarise_mean <- function(data, vars) {
    # all variables selected by user... 
    data %>% summarise(n = n(), across({{ vars }}, mean))
}
# call 
starwars %>% 
    group_by(homeworld) %>% 
    # with where
    summarise_mean(where(is.numeric))

2) DATA MASKING used in ACTION VERBS

ACTION VERBS = dplyr::mutate(), ggplot2::aes(), arrange(), count(), filter(), group_by(), and summarise().

Normal interactive programming (tidyverse) use data-masking, which allow you to use variables in the “current” data frame without any extra syntax. This:

  • (+) makes it nicer to interactively work (no extra typing of data$column, just column), but
  • (-) makes it harder to create your own functions (it could be ambiguous what is a data-variable and what is an env-variable).
## ---  tidyverse non std eval 
starwars %>% 
 filter (homeworld == "Tatooine")  

SOOOOO We need some way to add $ back into the picture. Passing data-masked arguments to functions requires INJECTION (= quasiquotation), i.e. TO INJECT A FUNCTION ARGUMENT IN A DATA-MASKED CONTEST, YOU EMBRACE IT WITH {{

Inside data-masking function (actions), we can use injection operators:

    + `{{` embracing operator (`rlang`)
    + `!!` operator (`base`) 
    + `.data` pronouns. 
    + `.env` pronouns.

Defuse with “embracing” {variable } (inside custom f)

## --- own function with tidyverse non std eval ACTION VERB `group_by()`
my_summarise <- function(data, group_var) {
  data %>%
# tell to inject whatever argument supplied to the function (homeworld) 
        # **in place** of "group_var" 
    group_by({{ group_var }}) %>%
    summarise(mean = mean(height))
}

# call(s) 
my_summarise (starwars, homeworld) 
my_summarise (starwars, sex) 
Warning

Without {group_var} I would get the error
“! Must group by variables found in .data. ✖ Column group_var is not found.”

Different options

1 Defuse (nothing!) + Inject {{ (inside custom f)

# -------- OR 
grouped_mean_1 <- function(df, group_var, summarize_var) {
    df %>% 
# Defuse and inject in a single step with the embracing operator
    group_by({{group_var}} ) %>% 
    summarize(mean = mean({{summarize_var}} , na.rm = TRUE))
}

# call
grouped_mean_1(
  df = starwars, 
  group_var = sex, 
  summarize_var =  height 
)

2 Defuse enquo + Inject !! (inside custom f)

# We can tell group_by() not to quote by using !! (pronounced “bang bang”). !! says something like “evaluate me!” or “unquote!”test 

grouped_mean_2 <- function(df, group_var, summarize_var) {
    ## -- Defuse the user expression in `*_var`
    group_var = enquo(group_var)
    summarize_var = enquo(summarize_var)

  df %>% 
    ## -- Inject the expression contained in `*_var` 
    group_by(!!group_var) %>% 
    summarize(mean = mean(!!summarize_var, na.rm = TRUE))
}

# call
grouped_mean_2(
  df = starwars, 
  group_var = sex, 
  summarize_var =  height 
)

3 Defuse ... + Inject ...

In this case, summarize_var goes in front and ... last

  • ... can stand for multiple variables
# ---- func 
grouped_mean_3 <- function(df,  summarize_var, ...) {
  
    ## -- Defuse the summarize_var = enquo(summarize_var) 
    ## ... group_var >>>> NO NEED FOR ENQUO with ... !
    summarize_var = enquo(summarize_var)
    
    df %>% 
        group_by(...) %>% 
        summarize(mean = mean(!!summarize_var, na.rm = TRUE))
}

# ---- call
grouped_mean_3(
    df = starwars, 
    sex, homeworld, # (...)
    summarize_var =  height 
)
Tip

{...} Basically we are saying “everything I throw at the function will be carried along until I want to evaluate it”

Different options (with left side eq)

1b (nothing!) + {{ & left side := !!!!

  • Super compact left side syntax with "sometext_{{group_var}}" :=
# --- func
grouped_mean_1b <- function(df, group_var, summarize_var) {
    df %>% 
        # Defuse and inject in a single step with the embracing operator
        group_by({{group_var}} ) %>% 
        summarize( "BY_{{group_var}}" := mean({{summarize_var}} , na.rm = TRUE))
}

# --- call
grouped_mean_1b (
    df = starwars, 
    group_var = sex, 
    summarize_var =  height 
)

2b enquo + !! & left side :=

2 things needed here :

+  `as_label(enquo(____var))`
+  left side syntax with `!!str_c("Mean_", ____var) :=`
# --- func
grouped_mean_2b <- function(df, group_var, summarize_var) {
    ## -- Defuse the user expression in `*_var`
    group_var = enquo(group_var)
    summarize_var = as_label(enquo(summarize_var)) # as_label(enquo !!!!!
    
    df %>% 
        ## -- Inject the expression contained in `*_var` 
        group_by(!!group_var) %>% 
        summarize(!!str_c("Mean_", summarize_var) := mean(!!summarize_var, na.rm = TRUE))
}

# --- call
grouped_mean_2b(df = starwars, 
                     group_var = sex, 
                     summarize_var =  height 
)

3b ... + ... & left side :=

## --  define function 
grouped_mean_3b <- function(df,  summarize_var, ...) {
#   group_var = ...  NO NEED FOR ENQUO!
    summarize_var = enquo(summarize_var)
    summarize_var_name <- as_label(enquo(summarize_var))
    
    df %>% 
        group_by(...) %>% 
    #   summarize(!!summarize_var_name := mean(!!summarize_var, na.rm = TRUE))
    # or  
    summarize(!!str_c("My_mean_", summarize_var_name) := mean(!!summarize_var, na.rm = TRUE))
    # ERRORE ?!?!?!?
    # summarize(str_c("Mean_", !!summarize_var_name) := mean(!!summarize_var, na.rm = TRUE))
}

## --  call function
grouped_mean_3b(df = starwars, 
                     sex, homeworld, #  group_var
                     summarize_var =  height 
)
Warning

OKKIO!!! Strange enough… seems like the unquoting must be of the WHOLE left-side of the equation not just of the quoted variable as I thought + !!summarize_var_name := ... OK + !!str_c("Mean_", summarize_var_name) := ... OK: xchè?????? + str_c("Mean_", !!summarize_var_name) := ... WRONG: xchè??????

Using .data

It’s good practice to prefix named arguments with a . (.data)to reduce the risk of conflicts between your arguments and the arguments passed to ...

## --  define function 
grouped_mean_4 <- function(data,  summarize_var, ...) {
  
#   group_var = enquo(group_var) NO NEED FOR ENQUO!
    summarize_var = enquo(summarize_var)
    summarize_var_name <- as_label(enquo(summarize_var))

    data %>% 
        group_by(...) %>% 
    #   summarize(!!summarize_var_name := mean(!!summarize_var, na.rm = TRUE))
    # or even better 
    summarize(!!str_c("Mean_", summarize_var_name) := mean(!!summarize_var, na.rm = TRUE))
    
}

## --  call function
grouped_mean_4(
    data = starwars, 
    summarize_var =  height ,
    sex, homeworld
)

Examples

starwars <- starwars

# I FORWARD a (masked)  argument with DOUBLE EMBRACE 
my_summarise <- function(data, var) {
  data %>% dplyr::summarise(Mean =mean( {{ var }}, na.rm = TRUE ))
}

## --  call function
call <- my_summarise (starwars, height)

# The .data pronoun is a tidy eval feature that is enabled in all data-masked arguments, just like {{
my_summarise2 <- function(data, var) {
  data %>% dplyr::summarise(mean = mean(.data[[var]], na.rm = TRUE ))
}

call2 <- my_summarise2 (starwars, "height")
# # ------- 1/3 FUNZIONE: raggruppo (2 gruppi) e calcolo qualche proporzione 
# # https://www.youtube.com/watch?v=pcvWKVlRmwE
# 
# f_prop_grouping <- function(start_df, end_df, group_var1, group_var2, dummy1, dummy2  ) {
#   
#   end_df <- start_df %>% 
#     # grouping var(s)
#     group_by( {{group_var1}} ,  {{group_var2}} # misura e stato
#     ) %>% 
#     summarise(n_group = n(),# n_stato
#               TOT_dummy1 = sum({{dummy1}}, na.rm = TRUE ), # SUM_dummy1
#               TOT_dummy2 = sum({{dummy2}}, na.rm = TRUE ) # SUM_dummy2
#     ) %>% 
#     mutate(N_group = sum(n_group), # N_stato
#            Perc_group =  paste0(round(n_group/N_group,3)*100 ,"%"), # %_stato
#            Perc_group_dummy1 = paste0(round(TOT_dummy1/n_group,3)*100 ,"%"), # % dummy over n(group)
#            Perc_group_dummy2 = paste0(round(TOT_dummy2/n_group,3)*100 ,"%"),
#            
#     ) }

# run
# end_df <- f_prop_grouping(AL_anagr_stato_t, end_df, misura, stato, inizio_past, inizio_rit )

# ------- 2/3 FUNZIONE: raggruppo (2 gruppi) e calcolo qualche proporzione 
f_prop_grouping2 <- function(start_df, end_df, group_var1, group_var2, dummy1, dummy2 ) {  
  # This to use the "walrus operator" := on the LEFT (naming the derived vars )
  dummy1name <- as_label(enquo(dummy1))
  dummy2name <- as_label(enquo(dummy2))
 
  
  end_df <- start_df %>% 
    # grouping var(s)
    group_by( {{group_var1}}, {{group_var2}} # misura e stato
    ) %>% 
    summarise(n_group = n(),# n_stato
              !!str_c("TOT_", dummy1name) := sum({{dummy1}}, na.rm = TRUE ), # SUM_dummy1
              !!str_c("PERC_", dummy1name) := paste0(round(sum({{dummy1}}, na.rm = TRUE)/n_group,3)*100 ,"%"), # % dummy over n(group)
              
              !!str_c("TOT_", dummy2name) := sum({{dummy2}}, na.rm = TRUE ),  # SUM_dummy2
              !!str_c("PERC_", dummy2name) := paste0(round(sum({{dummy2}}, na.rm = TRUE)/n_group,3)*100 ,"%")   # % dummy over n(group) 
               ) %>% 
    # ? NOT SURE WHY HERE IT ONLY CONSIDERS 1ST GROUPING VARIABLE
    mutate (N_group = sum(n_group),
            Perc_group =  paste0(round(n_group/N_group,3)*100 ,"%")  # %_stato
    ) %>% 
    relocate (c("N_group","Perc_group" ), .before = !!str_c("TOT_", dummy1name))  
}

# # ------- 3/3 FUNZIONE: raggruppo (2 gruppi) e calcolo qualche proporzione 
# # using [“enquo” + “!!”  ]      |  "syms" function and the “!!!”  (for multiple vars) 
# f_prop_grouping3 <- function(start_df, end_df, group_vars, dummy1, dummy2  ) {
#   # define the list of group_by vars "syms" 
#   group_vars <- syms(group_vars)
#   # This to use the "walrus operator" := on the LEFT (naming the created var  
#   dummy1name <- as_label(enquo(dummy1))
#   dummy2name <- as_label(enquo(dummy2))
#   
#   end_df <- start_df %>% 
#     # call grouping var(s) “!!!” 
#     group_by( !!!group_vars ) %>%    
#     summarise(n_group = n(),# n_stato
#               # TOT_dummy1 = sum({{dummy1}}, na.rm = TRUE ), # SUM_dummy1
#               # TOT_dummy2 = sum({{dummy2}}, na.rm = TRUE ) # SUM_dummy2
#               !!str_c("TOT_", dummy1name) := sum({{dummy1}}, na.rm = TRUE ), # SUM_dummy1
#               !!str_c("PERC_", dummy1name) := paste0(round(sum({{dummy1}}, na.rm = TRUE)/n_group,3)*100 ,"%"), # % dummy over n(group)
#               
#               !!str_c("TOT_", dummy2name) := sum({{dummy2}}, na.rm = TRUE ),  # SUM_dummy2
#               !!str_c("PERC_", dummy2name) := paste0(round(sum({{dummy2}}, na.rm = TRUE)/n_group,3)*100 ,"%") # % dummy over n(group) 
#               ) %>% 
#     # ? NOT SURE WHY HERE IT ONLY CONSIDERS 1ST GROUPING VARIABLE
#     mutate (N_group = sum(n_group),
#             Perc_group =  paste0(round(n_group/N_group,3)*100 ,"%")  # %_stato
#     ) %>% 
#     relocate (c("N_group","Perc_group" ), .before = !!str_c("TOT_", dummy1name))  
#     
#   
# }


# ------- 2/3 FUNZIONE: raggruppo (2 gruppi) e calcolo qualche MEDIA
f_mean_grouping2 <- function(start_df, end_df, group_var1, group_var2, numer1, numer2) {  
  # This to use the "walrus operator" := on the LEFT (naming the derived vars )
  numer1name <- as_label(enquo(numer1))
  numer2name <- as_label(enquo(numer2))
  
  # operations
  end_df <- start_df %>% 
    # grouping var(s)
    group_by( {{group_var1}}, {{group_var2}} # misura e stato
    ) %>% 
    summarise(n_group = n(),# n_stato
              !!str_c("Media_", numer1name) := mean({{numer1}}, na.rm = TRUE),
              !!str_c("Media_", numer2name) := mean({{numer2}}, na.rm = TRUE)
              ) %>% 
    # ? NOT SURE WHY HERE IT ONLY CONSIDERS 1ST GROUPING VARIABLE
    mutate (N_group = sum(n_group),
            Perc_group =  paste0(round(n_group/N_group,3)*100 ,"%")  # %_stato
    ) %>% 
    relocate (c("N_group","Perc_group" ), .before = !!str_c("Media_", numer1name) )
}

# OKKIO!!!!!! 
#end_df <- f_prop_grouping2(AL_anagr_stato_t, end_df, c("misura", "stato"), inizio_past, inizio_rit )


# f_prop_grouping2 <- function(start_df, end_df, group_var1, group_var2, ... ) {
#   
#   end_df <- start_df %>% 
#     # grouping var(s)
#     group_by( {{group_var1}} ,  {{group_var2}} # misura e stato
#     ) %>% 
#     summarise(n_group = n(),# n_stato
#               TOT_dummy1 = sum(..., na.rm = TRUE ), # SUM_dummy1
#               TOT_dummy2 = sum(..., na.rm = TRUE ) # SUM_dummy2
#     ) %>% 
#     mutate(N_group = sum(n_group), # N_stato
#            Perc_group =  paste0(round(n_group/N_group,3)*100 ,"%"), # %_stato
#            Perc_group_dummy1 = paste0(round(TOT_dummy1/n_group,3)*100 ,"%"), # % dummy over n(group)
#            Perc_group_dummy2 = paste0(round(TOT_dummy2/n_group,3)*100 ,"%"),
#            
#     ) }



# ------- FUNZIONE: generalizzo il nome della fase (prefix) 
f_rimuovo_pref <- function(data, prefix = "word_"){
  rename_with(.data = data, 
              .cols = dplyr::starts_with(prefix), # (default e' everything() e le pescava comunque)
              # rinomino le date eliminando il prefisso
              .fn = function(x)sub(prefix,"",x))
  
}

# EXE uso
#BC_PROGETTAZIONE_temp <- f_rimuovo_pref(BC_PROGETTAZIONE , prefix = "PROG_ESEC_") 


# -------- FUNZIONE: introduco qualche calcolo sulle date delle fasi procedurali
f_calcoli_date <- function(data, inizio_fase_prev, inizio_fase_eff, fine_fase_prev, fine_fase_eff) {
  dplyr::mutate(data,
                durata_prev =  {{fine_fase_prev}} - {{inizio_fase_prev}} ,
                #durata_eff =   {{fine_eff}} - {{inizio_eff}} ,
                inizio_V_today = case_when(
                  {{inizio_fase_prev}} <= today() ~ "pre_oggi",
                  {{inizio_fase_prev}} > today() ~  "post_oggi",
                  TRUE ~ "Ignoto"),
                inizio_discrep = {{inizio_fase_eff}} - {{inizio_fase_prev}} ,
                inizio_ritardo = case_when(
                  inizio_V_today == "Ignoto" ~ "[Fase non prevista]",
                  inizio_V_today == "post_oggi" ~ "Inizio previsto futuro",
                  inizio_V_today == "pre_oggi" & !is.na({{inizio_fase_eff}}) ~ if_else(
                    inizio_discrep > 0 , glue("rit = {inizio_discrep} gg"), glue("ant = {inizio_discrep} gg")
                  ),
                  inizio_V_today == "pre_oggi" & is.na({{inizio_fase_eff}}) ~  "No inizio effettivo")
  )
}

# EXE uso
# BC_PROGETTAZIONE_calc <-  f_calcoli_date (
#   data = BC_PROGETTAZIONE_temp , 
#   inizio_fase_prev, inizio_fase_eff, fine_fase_prev, fine_fase_eff)

MORE

https://jonthegeek.com/2018/06/04/writing-custom-tidyverse-functions/